Search CORE

121 research outputs found

Learning Models over Relational Data using Sparse Tensors and Functional Dependencies

Author: Khamis Mahmoud Abo
Ngo Hung Q.
Nguyen XuanLong
Olteanu Dan
Schleich Maximilian
Publication venue
Publication date: 01/01/2020
Field of study

Integrated solutions for analytics over relational databases are of great practical importance as they avoid the costly repeated loop data scientists have to deal with on a daily basis: select features from data residing in relational databases using feature extraction queries involving joins, projections, and aggregations; export the training dataset defined by such queries; convert this dataset into the format of an external learning tool; and train the desired model using this tool. These integrated solutions are also a fertile ground of theoretically fundamental and challenging problems at the intersection of relational and statistical data models. This article introduces a unified framework for training and evaluating a class of statistical learning models over relational databases. This class includes ridge linear regression, polynomial regression, factorization machines, and principal component analysis. We show that, by synergizing key tools from database theory such as schema information, query structure, functional dependencies, recent advances in query evaluation algorithms, and from linear algebra such as tensor and matrix operations, one can formulate relational analytics problems and design efficient (query and data) structure-aware algorithms to solve them. This theoretical development informed the design and implementation of the AC/DC system for structure-aware learning. We benchmark the performance of AC/DC against R, MADlib, libFM, and TensorFlow. For typical retail forecasting and advertisement planning applications, AC/DC can learn polynomial regression models and factorization machines with at least the same accuracy as its competitors and up to three orders of magnitude faster than its competitors whenever they do not run out of memory, exceed 24-hour timeout, or encounter internal design limitations.Comment: 61 pages, 9 figures, 2 table

arXiv.org e-Print Archive

Oxford University Research Archive

Efficiently decodable non-adaptive group testing

Author: Indyk Piotr
Ngo Hung Q.
Rudra Atri
Publication venue: Society for Industrial and Applied Mathematics / Association for Computing Machinery
Publication date: 01/01/2010
Field of study

We consider the following "efficiently decodable" non-adaptive group testing problem. There is an unknown string x 2 f0; 1gn [x is an element of set {0,1} superscript n] with at most d ones in it. We are allowed to test any subset S [n] [S subset [n] ]of the indices. The answer to the test tells whether xi = 0 [x subscript i = 0] for all i 2 S [i is an element of S] or not. The objective is to design as few tests as possible (say, t tests) such that x can be identifi ed as fast as possible (say, poly(t)-time). Efficiently decodable non-adaptive group testing has applications in many areas, including data stream algorithms and data forensics. A non-adaptive group testing strategy can be represented by a t x n matrix, which is the stacking of all the characteristic vectors of the tests. It is well-known that if this matrix is d-disjunct, then any test outcome corresponds uniquely to an unknown input string. Furthermore, we know how to construct d-disjunct matrices with t = O(d2 [d superscript 2] log n) efficiently. However, these matrices so far only allow for a "decoding" time of O(nt), which can be exponentially larger than poly(t) for relatively small values of d. This paper presents a randomness efficient construction of d-disjunct matrices with t = O(d2 [d superscript 2] log n) that can be decoded in time poly(d) [function composed of] t log2 t + O(t2) [t log superscript 2 t and O (t superscript 2)]. To the best of our knowledge, this is the first result that achieves an efficient decoding time and matches the best known O(d2 log n) [O (d superscript 2 log n)] bound on the number of tests. We also derandomize the construction, which results in a polynomial time deterministic construction of such matrices when d = O(log n= log log n). A crucial building block in our construction is the notion of (d,l)-list disjunct matrices, which represent the more general "list group testing" problem whose goal is to output less than d + l positions in x, including all the (at most d) positions that have a one in them. List disjunct matrices turn out to be interesting objects in their own right and were also considered independently by [Cheraghchi, FCT 2009]. We present connections between list disjunct matrices, expanders, dispersers and disjunct matrices. List disjunct matrices have applications in constructing (d,l)- sparsity separator structures [Ganguly, ISAAC 2008] and in constructing tolerant testers for Reed-Solomon codes in the data stream model. 1 IntroductionDavid & Lucile Packard FoundationCenter for Massive Data Algorithmics (MADALGO)National Science Foundation (U.S.) (Grant CCF-0728645)National Science Foundation (U.S.) (Grant CCF-0347565)National Science Foundation (U.S.) (CAREER Award CCF-0844796

CiteSeerX

DSpace@MIT